Kembali ke Beranda
Whitepaper

Hadoop Performance Analysis on Raspberry Pi for DNA Sequence Alignment

The rapid development of electronic data has created challenges in storing and processing big data. Hadoop provides a distributed computing framework that can run on commodity hardware. This research analyzes the performance of a Hadoop cluster built using six Raspberry Pi Model B devices for DNA sequence alignment using the Biodoop library.

Hadoop Performance Analysis on Raspberry Pi for DNA Sequence Alignment

Authors: Jaya Sena Turana, Heru Sukoco, Wisnu Ananta Kusuma Journal: TELKOMNIKA, Vol. 14 No. 3 (2016) DOI: 10.12928/TELKOMNIKA.v14i3.1886

---

Abstract

The rapid development of electronic data has created challenges in storing and processing big data. Hadoop provides a distributed computing framework that can run on commodity hardware. This research analyzes the performance of a Hadoop cluster built using six Raspberry Pi Model B devices for DNA sequence alignment using the Biodoop library.

Key results:

  • Average CPU usage: 73.08%
  • Average memory usage: 334.69 MB
  • Average job completion time: 19.89 minutes
  • Block distribution reduced CPU usage by 24.14%
  • Block distribution reduced memory usage by 8.49%
  • Processing time increased by 31.53%

---

1. Introduction

Big Data is characterized by:

  • Volume
  • Variety
  • Velocity

Hadoop provides:

  • HDFS (Hadoop Distributed File System)
  • MapReduce processing framework

Raspberry Pi was selected because:

  • Low cost
  • Low power consumption
  • Easy cluster deployment
  • Suitable for educational and research environments

The objective of this research is to analyze Hadoop cluster performance on Raspberry Pi for DNA sequence alignment.

---

2. Related Works

Several Raspberry Pi clusters have previously been developed:

ClusterNodes
Iridis-Pi64
Glasgow Raspberry Pi Cloud54
Bolzano Raspberry Pi Cloud300
HPC Cluster14

These studies demonstrated the feasibility of low-cost distributed computing.

---

3. Research Method

Hardware

  • 6 Raspberry Pi Model B
  • 1 NameNode
  • 5 DataNodes
  • Raspbian OS

Software

  • Apache Hadoop
  • HDFS
  • MapReduce
  • Biodoop
  • Ganglia Monitoring

DNA Datasets

DNA SequenceLength (bp)
Ancylostoma duodenale mitochondrion13,271
Necator americanus mitochondrion13,605
Chaetoceros tenuissimus DNA virus5,639
Chaetoceros lorenzianus DNA virus5,813
Human papillomavirus type 1327,125
Human papillomavirus type 1347,309

Biodoop Workflow

  1. Upload FASTA file into HDFS
  2. Convert FASTA to TAB format using fasta2tab
  3. Execute BLAST MapReduce job using biodoop_blast
  4. Analyze alignment results

---

4. Hadoop Configuration

Heap Size Experiment

Heap Size (MB)Result
64Failed
128Failed
192Failed
256Failed
320Failed
384Success
448Failed
512Failed

Optimal value:

TEXT
HADOOP_HEAPSIZE = 384 MB

Additional Configuration

TEXT
dfs.client.file-block-storage-locations.timeout = 1200
dfs.namenode.fs-limits.min-block-size = 512

Custom block sizes used:

TEXT
3 KB
5 KB
10 KB

---

5. DNA Sequence Alignment Results

Test 1

Reference:

TEXT
Ancylostoma duodenale mitochondrion

Query:

TEXT
Necator americanus mitochondrion

Results:

  • Similarity > 89%
  • Bit Score = 1225

Raspberry Pi Performance

MetricDefault Block10 KB Block
CPU (%)82.6962.50
Memory (MB)344.76335.91
Time (min)17.6625.86

---

Test 2

Reference:

TEXT
Human papillomavirus type 132

Query:

TEXT
Human papillomavirus type 134

Results:

  • Similarity > 93%
  • Bit Score = 43.5

Raspberry Pi Performance

MetricDefault Block5 KB Block
CPU (%)92.1467.01
Memory (MB)355.49312.36
Time (min)15.4519.84

---

Test 3

Reference:

TEXT
Chaetoceros tenuissimus DNA virus

Query:

TEXT
Chaetoceros lorenzianus DNA Virus

Results:

  • Similarity > 83%
  • Bit Score = 63

Raspberry Pi Performance

MetricDefault Block3 KB Block
CPU (%)71.2462.88
Memory (MB)348.62311.02
Time (min)15.0725.44

---

6. Analysis

Effects of Hadoop block distribution:

Advantages

  • CPU usage reduced by 24.14%
  • Memory usage reduced by 8.48%
  • Better workload distribution across DataNodes

Disadvantages

  • Processing time increased by 31.53%
  • Additional MapReduce overhead
  • Additional block-splitting overhead

For small DNA files, Hadoop overhead becomes significant relative to computation time.

---

7. Conclusion

Raspberry Pi can successfully be used as a low-cost Hadoop cluster platform for bioinformatics workloads.

Main findings:

  • Hadoop runs reliably on Raspberry Pi.
  • DNA sequence alignment using Biodoop was successfully executed.
  • Average CPU utilization reached 73.08%.
  • Average memory utilization reached 334.69 MB.
  • Average completion time was 19.89 minutes.
  • Block distribution reduced CPU and memory usage.
  • Processing time increased because of MapReduce overhead.

---

Future Work

Future research may include:

  1. Hadoop Native ARM Library
  2. Newer Raspberry Pi hardware
  3. Larger DNA datasets
  4. Spark-based implementation
  5. Energy efficiency analysis

---

Citation

Turana, J. S., Sukoco, H., & Kusuma, W. A. (2016). Hadoop Performance Analysis on Raspberry Pi for DNA Sequence Alignment. TELKOMNIKA, 14(3), 1059–1066.

hadoopbig datadna sequence alignment