pyspark.SparkContext.binaryFiles

SparkContext.binaryFiles(path: str, minPartitions: Optional[int] = None) → pyspark.rdd.RDD[Tuple[str, bytes]][source]

Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

New in version 1.3.0.

Parameters
pathstr

directory to the input data files, the path can be comma separated paths as a list of inputs

minPartitionsint, optional

suggested minimum number of partitions for the resulting RDD

Returns
RDD

RDD representing path-content pairs from the file(s).

Notes

Small files are preferred, large file is also allowable, but may cause bad performance.

Examples

>>>
>>> import os
>>> import tempfile
>>> with tempfile.TemporaryDirectory() as d:
...     # Write a temporary binary file
...     with open(os.path.join(d, "1.bin"), "wb") as f1:
...         _ = f1.write(b"binary data I")
...
...     # Write another temporary binary file
...     with open(os.path.join(d, "2.bin"), "wb") as f2:
...         _ = f2.write(b"binary data II")
...
...     collected = sorted(sc.binaryFiles(d).collect())
>>>
>>> collected
[('.../1.bin', b'binary data I'), ('.../2.bin', b'binary data II')]