So, you are just landing in a big data project. Everybody knows how to use HDFS except you. All the data is in such a big cluster and you don’t know how to access to it.
You are not really into graphic interfaces, so you don’t really enjoy Cloudera Workbench/ HUE / Zeppeling or other of the tools that you are likely to have in your company. Don’t Panic!
See here some quick advices and tricks for using your new environment through the terminal.
Create directory in HDFS in terminal:
1 |
hdfs dfs -mkdir hdfs://path |
See directory in terminal
1 |
hdfs dfs -ls hdfs://path |
I’m sure at this stage you already got the idea. It’s hdfs dfs –(linux command) path to do any stuff in HDFS.
See https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html for other HDFS DFS terminal commands.
Check HDFS directory size in GBs
1 |
sudo -u hdfs hadoop fs -du /dhfs_path | awk '/^[0-9]+/ { print int($1/(1024**3)) " [GB]\t" $2 }' |
Is this getting too technical for you?
At the end of the day everything relies on the technology and languages that you use. Being an Analyst, this may be Hive, Impala, R, pySpark 90% of the time. Probably for data cleaning and debugging; hopefully for more interesting stuff…
Then you may not even need to care about the HDFS commands listed above most of the time. Not PIG, Scoop or data engineering stuff… only typing your pySpark or Hive alias you’ll have more than enough! For the rest of you, the techies ones. See below some ideas to play around in your cluster.
Kill a job running under your session:
1 |
yarn application -kill app_id (e.g. application_1583734825_1310) |
See application logs in terminal:
1 |
yarn logs -applicationId application_1583734825_1310 |
Kill processes running in the background under your surname
1 2 3 |
ps –ef | grep username kill -9 (list of PIDs) |
Show processes running under your username
1 |
ps -L -u username | wc -l |
run it under ‘watch’ to auto refresh in terminal
1 |
watch -n 0.5 "ps -f -L -u username | wc -l" |
Create a loop with an sleep to keep your terminal session open (if it closes due to inactivity)
Create the file
1 |
vim awake_me |
Write the code
1 2 3 4 5 6 7 |
#!/bin/bash</p> while true do echo "." sleep 1800 ls -al /proc/$$/fd/1 | grep -q deleted && exit done |
Run it on background
1 |
./awake_me& |
Use PIG to merge a ton of text files in HDFS, …
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
-- PIG SCRIPT -- Instructions to export file in HDFS path -- 1 Run the next in terminal vimfilename.pig -- 2 Press 'i' for writing mode. -- 3 Paste the script below using export path <pre class="lang:sh decode:true " title="PIG SCRIPT">set default_parallel 1; set pig.exec.reducers.max=1; -- pig.exec.reducers.bytes.per.reducer - Defines the number of input bytes per reduce; default value is 1000*1000*1000 (1GB). pig_object = LOAD 'hdfs://path' USING AvroStorage () PARALLEL 1; STORE pig_object INTO 'hdfs://path' USING PigStorage('|') PARALLEL 1; </pre> -- 4 Press ESC for menu -- 5 Write wq! to save, close and overwritte at the same time. -- 6 Run the script in Pig using the line below <code>pig filename.pig</code> -- 7 Move to the Landing Zone running the enxt below <code>hadoop fs -get hdfs://path</code> -- 8 Copy also the file headers in that location -- END OF NORMAL PROCESS |
… do it with the PIG fs functions,
1 2 3 4 |
-- 1 start pig pig -- 2 run merge function fs -getmerge 'hdfs://path1' 'hdfs://path2' |
…or even do it in parallel by calling Hadoop (out of PIG)
1 |
hadoop fs -getmerge 'hdfs://path1' 'hdfs://path2/output.csv' |
Search in HDFS using SolR
Searching in terminal is normally as easy as:
1 |
find . -type f -print | xargs grep "example" |
But when it comes to a Big Data cluster where you have TeraBytes (if not PetaBytes) of data. You need to do something smarter. Why not using SolR for this? The command below will convert Hadoop in your Searching Butler!
1 2 3 4 5 6 7 |
export hdfs_find_location_1="/searching_path/*"; for var in `hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.HdfsFindTool \ -find ${hdfs_find_location_1} -type f -name '*string_to_search*' `; do echo ${var} done |
Hopefully now you are ready to go!
This was only the first article of a series of advices, tricks and cheatsheets. All based on my experience the last couple of years. For any query, correction or contribution, any comments are really welcome 🙂
Stay connected