Big Data Project using the terminal

So, you are just landing in a big data project. Everybody knows how to use HDFS except you. All the data is in such a big cluster and you don’t know how to access to it.

You are not really into graphic interfaces, so you don’t really enjoy Cloudera Workbench/ HUE / Zeppeling or other of the tools that you are likely to have in your company. Don’t Panic!

See here some quick advices and tricks for using your new environment through the terminal.

Create directory in HDFS in terminal:

hdfs dfs -mkdir hdfs://path

1	hdfs dfs -mkdir hdfs://path

See directory in terminal

hdfs dfs -ls hdfs://path

1	hdfs dfs -ls hdfs://path

I’m sure at this stage you already got the idea. It’s hdfs dfs –(linux command) path to do any stuff in HDFS.

See https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html for other HDFS DFS terminal commands.

Check HDFS directory size in GBs

sudo -u hdfs hadoop fs -du /dhfs_path | awk '/^[0-9]+/ { print int($1/(1024**3)) " [GB]\t" $2 }'

1	sudo -u hdfs hadoop fs -du /dhfs_path \| awk '/^[0-9]+/ { print int($1/(1024**3)) " [GB]\t" $2 }'

Is this getting too technical for you?

At the end of the day everything relies on the technology and languages that you use. Being an Analyst, this may be Hive, Impala, R, pySpark 90% of the time. Probably for data cleaning and debugging; hopefully for more interesting stuff…

Then you may not even need to care about the HDFS commands listed above most of the time. Not PIG, Scoop or data engineering stuff… only typing your pySpark or Hive alias you’ll have more than enough! For the rest of you, the techies ones. See below some ideas to play around in your cluster.

Kill a job running under your session:

yarn application -kill app_id (e.g. application_1583734825_1310)

1	yarn application -kill app_id (e.g. application_1583734825_1310)

See application logs in terminal:

yarn logs -applicationId application_1583734825_1310

1	yarn logs -applicationId application_1583734825_1310

Kill processes running in the background under your surname

ps –ef | grep username

kill -9 (list of PIDs)

ps –ef | grep username

kill -9 (list of PIDs)

Show processes running under your username

ps -L -u username | wc -l

1	ps -L -u username \| wc -l

run it under ‘watch’ to auto refresh in terminal

watch -n 0.5 "ps -f -L -u&nbsp; username &nbsp; | wc -l"

1	watch -n 0.5 "ps -f -L -u  username   \| wc -l"

Create a loop with an sleep to keep your terminal session open (if it closes due to inactivity)

Create the file

vim&nbsp;awake_me

1	vim awake_me

Write the code

#!/bin/bash</p>
while true
do
echo "."
sleep 1800
ls -al /proc/$$/fd/1 | grep -q deleted && exit
done

#!/bin/bash</p>

while true

echo "."

sleep 1800

ls -al /proc/$$/fd/1 | grep -q deleted && exit

done

Run it on background

./awake_me&

1	./awake_me&

Use PIG to merge a ton of text files in HDFS, …

-- PIG SCRIPT

-- Instructions to export file in HDFS path
-- 1 Run the next in terminal
vimfilename.pig
-- 2 Press 'i' for writing mode.
-- 3 Paste the script below using export path
<pre class="lang:sh decode:true " title="PIG SCRIPT">set default_parallel 1;
set pig.exec.reducers.max=1;
-- pig.exec.reducers.bytes.per.reducer - Defines the number of input bytes per reduce; default value is 1000*1000*1000 (1GB).
pig_object = LOAD 'hdfs://path' USING AvroStorage () PARALLEL 1;
STORE pig_object INTO 'hdfs://path' USING PigStorage('|') PARALLEL 1;
</pre>
-- 4 Press ESC for menu
-- 5 Write wq! to save, close and overwritte at the same time.
-- 6 Run the script in Pig using the line below
<code>pig filename.pig</code>
-- 7 Move to the Landing Zone running the enxt below
<code>hadoop fs -get hdfs://path</code>

-- 8 Copy also the file headers in that location

-- END OF NORMAL PROCESS

-- PIG SCRIPT

-- Instructions to export file in HDFS path

-- 1 Run the next in terminal

vimfilename.pig

-- 2 Press 'i' for writing mode.

-- 3 Paste the script below using export path

<pre class="lang:sh decode:true " title="PIG SCRIPT">set default_parallel 1;

set pig.exec.reducers.max=1;

-- pig.exec.reducers.bytes.per.reducer - Defines the number of input bytes per reduce; default value is 1000*1000*1000 (1GB).

pig_object = LOAD 'hdfs://path' USING AvroStorage () PARALLEL 1;

STORE pig_object INTO 'hdfs://path' USING PigStorage('|') PARALLEL 1;

</pre>

-- 4 Press ESC for menu

-- 5 Write wq! to save, close and overwritte at the same time.

-- 6 Run the script in Pig using the line below

<code>pig filename.pig</code>

-- 7 Move to the Landing Zone running the enxt below

<code>hadoop fs -get hdfs://path</code>

-- 8 Copy also the file headers in that location

-- END OF NORMAL PROCESS

… do it with the PIG fs functions,

-- 1 start pig
pig
-- 2 run merge function
fs -getmerge 'hdfs://path1' 'hdfs://path2'

-- 1 start pig

pig

-- 2 run merge function

fs -getmerge 'hdfs://path1' 'hdfs://path2'

…or even do it in parallel by calling Hadoop (out of PIG)

hadoop fs -getmerge 'hdfs://path1' 'hdfs://path2/output.csv'

1	hadoop fs -getmerge 'hdfs://path1' 'hdfs://path2/output.csv'

Search in HDFS using SolR

Searching in terminal is normally as easy as:

find . -type f -print | xargs grep "example"

1	find . -type f -print \| xargs grep "example"

But when it comes to a Big Data cluster where you have TeraBytes (if not PetaBytes) of data. You need to do something smarter. Why not using SolR for this? The command below will convert Hadoop in your Searching Butler!

export hdfs_find_location_1="/searching_path/*";

for var in `hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.HdfsFindTool \
-find ${hdfs_find_location_1} -type f&nbsp; -name '*string_to_search*'&nbsp; `; do
echo ${var}

done

export hdfs_find_location_1="/searching_path/*";

for var in `hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.HdfsFindTool \

-find ${hdfs_find_location_1} -type f  -name '*string_to_search*'  `; do

echo ${var}

done

Hopefully now you are ready to go!

This was only the first article of a series of advices, tricks and cheatsheets. All based on my experience the last couple of years. For any query, correction or contribution, any comments are really welcome 🙂

About Us

Stay connected

Trending Slider

A summary of how AI has progressed in the last 5 years and current challenges (by ChatGPT)

4 reasons for Agile in Analytics

Struggling with Hive… What can I do?

Working in a Big Data Project using the terminal

Scraping stock prices using Alpha Vantage and Google Finance

Flattening complex XML structures into Hive tables using Spark DFs

Latest

A summary of how AI has progressed in the last 5 years and current challenges (by ChatGPT)

4 reasons for Agile in Analytics

About Us

Stay connected

Working in a Big Data Project using the terminal

Create directory in HDFS in terminal:

See directory in terminal

Check HDFS directory size in GBs

Is this getting too technical for you?

Kill a job running under your session:

See application logs in terminal:

Kill processes running in the background under your surname

Show processes running under your username

run it under ‘watch’ to auto refresh in terminal

Create a loop with an sleep to keep your terminal session open (if it closes due to inactivity)

Create the file

Write the code

Run it on background

Use PIG to merge a ton of text files in HDFS, …

… do it with the PIG fs functions,

…or even do it in parallel by calling Hadoop (out of PIG)

Search in HDFS using SolR

cetrulin

Related posts

Scraping stock prices using Alpha Vantage and Google Finance

Struggling with Hive… What can I do?

A summary of how AI has progressed in the last 5 years and current challenges (by ChatGPT)

4 reasons for Agile in Analytics

Flattening complex XML structures into Hive tables using Spark DFs

A summary of how AI has progressed in the last 5 years and current challenges (by ChatGPT)

4 reasons for Agile in Analytics

Struggling with Hive… What can I do?

Working in a Big Data Project using the terminal

Scraping stock prices using Alpha Vantage and Google Finance

Flattening complex XML structures into Hive tables using Spark DFs

A summary of how AI has progressed in the last 5 years and current challenges (by ChatGPT)

4 reasons for Agile in Analytics