Skip to content Skip to sidebar Skip to footer

Python Write To Hdfs File

What is the best way to create/write/update a file in remote HDFS from local python script? I am able to list files and directories but writing seems to be a problem. I have searc

Solution 1:

try HDFS liberary.. its really good You can use write(). https://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client.write

Example:

to create connection:

from hdfs import InsecureClient
client = InsecureClient('http://host:port', user='ann')

from json import dump, dumps
records = [
  {'name': 'foo', 'weight': 1},
  {'name': 'bar', 'weight': 2},
]

# As a context manager:with client.write('data/records.jsonl', encoding='utf-8') as writer:
  dump(records, writer)

# Or, passing in a generator directly:
client.write('data/records.jsonl', data=dumps(records), encoding='utf-8')

For CSV you can do

import pandas as pd
df=pd.read.csv("file.csv")
with client_hdfs.write('path/output.csv', encoding = 'utf-8') aswriter:
  df.to_csv(writer)

Solution 2:

What's wrong with other answers

They use WebHDFS, which is not enabled by default, and insecure without Kerberos or Apache Knox.

This is what the upload function of that hdfs library you linked to uses.

Native (more secure) ways to write to HDFS using Python

You can use pyspark.

Example - How to write pyspark dataframe to HDFS and then how to read it back into dataframe?


snakebite has been mentioned, but it doesn't write files


pyarrow has a FileSystem.open() function that should be able to write to HDFS as well, though I've not tried.

Solution 3:

Without using a complicated library built for HDFS, you can also simply use the requests package in python for HDFS as:

import requests
from json import dumps
params = (
('op', 'CREATE')
)
data = dumps(file)  # some file or object - also tested for pickle library
response = requests.put('http://host:port/path', params=params, data=data)

If response is 200, then your connection is working! This technique lets you use all the utitities given by Hadoop's RESTful API: ls, md, get, post, etc.

You can also convert CURL commands to python through this:

  1. Get Command for HDFS: https://hadoop.apache.org/docs/r1.0.4/webhdfs.html
  2. Convert to python: https://curl.trillworks.com/

Hope this helps!

Post a Comment for "Python Write To Hdfs File"