pyspark.sql.functions.regexp_extract_all#

pyspark.sql.functions.regexp_extract_all(str, regexp, idx=None)[source]#

Extract all strings in the str that match the Java regex regexp and corresponding to the regex group index.

New in version 3.5.0.

Parameters

strColumn or column name: target column to work on.
regexpColumn or column name: regex pattern to apply.
idxColumn or int, optional: matched group id.

Returns

Column: all strings in the str that match a Java regex and corresponding to the regex group index.

See also

pyspark.sql.functions.regexp_extract()

Examples

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([("100-200, 300-400", r"(\d+)-(\d+)")], ["str", "regexp"])
>>> df.select('*', sf.regexp_extract_all('str', sf.lit(r'(\d+)-(\d+)'))).show()
+----------------+-----------+---------------------------------------+
|             str|     regexp|regexp_extract_all(str, (\d+)-(\d+), 1)|
+----------------+-----------+---------------------------------------+
|100-200, 300-400|(\d+)-(\d+)|                             [100, 300]|
+----------------+-----------+---------------------------------------+

>>> df.select('*', sf.regexp_extract_all('str', sf.lit(r'(\d+)-(\d+)'), sf.lit(1))).show()
+----------------+-----------+---------------------------------------+
|             str|     regexp|regexp_extract_all(str, (\d+)-(\d+), 1)|
+----------------+-----------+---------------------------------------+
|100-200, 300-400|(\d+)-(\d+)|                             [100, 300]|
+----------------+-----------+---------------------------------------+

>>> df.select('*', sf.regexp_extract_all('str', sf.lit(r'(\d+)-(\d+)'), 2)).show()
+----------------+-----------+---------------------------------------+
|             str|     regexp|regexp_extract_all(str, (\d+)-(\d+), 2)|
+----------------+-----------+---------------------------------------+
|100-200, 300-400|(\d+)-(\d+)|                             [200, 400]|
+----------------+-----------+---------------------------------------+

>>> df.select('*', sf.regexp_extract_all('str', sf.col("regexp"))).show()
+----------------+-----------+----------------------------------+
|             str|     regexp|regexp_extract_all(str, regexp, 1)|
+----------------+-----------+----------------------------------+
|100-200, 300-400|(\d+)-(\d+)|                        [100, 300]|
+----------------+-----------+----------------------------------+

>>> df.select('*', sf.regexp_extract_all(sf.col('str'), "regexp")).show()
+----------------+-----------+----------------------------------+
|             str|     regexp|regexp_extract_all(str, regexp, 1)|
+----------------+-----------+----------------------------------+
|100-200, 300-400|(\d+)-(\d+)|                        [100, 300]|
+----------------+-----------+----------------------------------+