Data Struggling
Lifesavers

Working in a Big Data Project using the terminal

So, you are just landing in a big data project. Everybody knows how to use HDFS except you. All the data is in such a big cluster and you don’t know how to access to it.

You are not really into graphic interfaces, so you don’t really enjoy Cloudera Workbench/ HUE / Zeppeling or other of the tools that you are likely to have in your company. Don’t Panic!

See here some quick advices and tricks for using your new environment through the terminal.

 

Create directory in HDFS in terminal:

See directory in terminal

I’m sure at this stage you already got the idea. It’s hdfs dfs –(linux command) path  to do any stuff in HDFS.

See https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html for other HDFS DFS terminal commands.

Check HDFS directory size in GBs

Is this getting too technical for you?

At the end of the day everything relies on the technology and languages that you use. Being an Analyst, this may be Hive, Impala, R, pySpark 90% of the time. Probably for data cleaning and debugging; hopefully for more interesting stuff…

Then you may not even need to care about the HDFS commands listed above most of the time. Not PIG, Scoop or data engineering stuff… only typing your pySpark or Hive alias you’ll have more than enough! For the rest of you, the techies ones. See below some ideas to play around in your cluster.

 

Kill a job running under your session:

See application logs in terminal:

Kill processes running in the background under your surname

Show processes running under your username

run it under ‘watch’ to auto refresh in terminal

 

Create a loop with an sleep to keep your terminal session open (if it closes due to inactivity)

Create the file

Write the code

Run it on background

 

Use PIG to merge a ton of text files in HDFS, …

… do it with the PIG fs functions,

…or even do it in parallel by calling Hadoop (out of PIG)

Search in HDFS using SolR

Searching in terminal is normally as easy as:

But when it comes to a Big Data cluster where you have TeraBytes (if not PetaBytes) of data. You need to do something smarter. Why not using SolR for this? The command below will convert Hadoop in your Searching Butler!

 

Hopefully now you are ready to go!

This was only the first article of a series of advices, tricks and cheatsheets. All based on my experience the last couple of years. For any query, correction or contribution, any comments are really welcome 🙂

Related posts

Flattening complex XML structures into Hive tables using Spark DFs

cetrulin
6 years ago

A summary of how AI has progressed in the last 5 years and current challenges (by ChatGPT)

cetrulin
1 year ago

Struggling with Hive… What can I do?

cetrulin
6 years ago
Exit mobile version