Hadoop Performance Analysis on Raspberry Pi for DNA Sequence Alignment
Hadoop Performance Analysis on Raspberry Pi for DNA Sequence Alignment
Authors: Jaya Sena Turana, Heru Sukoco, Wisnu Ananta Kusuma Journal: TELKOMNIKA, Vol. 14 No. 3 (2016) DOI: 10.12928/TELKOMNIKA.v14i3.1886
---
Abstract
The rapid development of electronic data has created challenges in storing and processing big data. Hadoop provides a distributed computing framework that can run on commodity hardware. This research analyzes the performance of a Hadoop cluster built using six Raspberry Pi Model B devices for DNA sequence alignment using the Biodoop library.
Key results:
- Average CPU usage: 73.08%
- Average memory usage: 334.69 MB
- Average job completion time: 19.89 minutes
- Block distribution reduced CPU usage by 24.14%
- Block distribution reduced memory usage by 8.49%
- Processing time increased by 31.53%
---
1. Introduction
Big Data is characterized by:
- Volume
- Variety
- Velocity
Hadoop provides:
- HDFS (Hadoop Distributed File System)
- MapReduce processing framework
Raspberry Pi was selected because:
- Low cost
- Low power consumption
- Easy cluster deployment
- Suitable for educational and research environments
The objective of this research is to analyze Hadoop cluster performance on Raspberry Pi for DNA sequence alignment.
---
2. Related Works
Several Raspberry Pi clusters have previously been developed:
| Cluster | Nodes |
|---|---|
| Iridis-Pi | 64 |
| Glasgow Raspberry Pi Cloud | 54 |
| Bolzano Raspberry Pi Cloud | 300 |
| HPC Cluster | 14 |
These studies demonstrated the feasibility of low-cost distributed computing.
---
3. Research Method
Hardware
- 6 Raspberry Pi Model B
- 1 NameNode
- 5 DataNodes
- Raspbian OS
Software
- Apache Hadoop
- HDFS
- MapReduce
- Biodoop
- Ganglia Monitoring
DNA Datasets
| DNA Sequence | Length (bp) |
|---|---|
| Ancylostoma duodenale mitochondrion | 13,271 |
| Necator americanus mitochondrion | 13,605 |
| Chaetoceros tenuissimus DNA virus | 5,639 |
| Chaetoceros lorenzianus DNA virus | 5,813 |
| Human papillomavirus type 132 | 7,125 |
| Human papillomavirus type 134 | 7,309 |
Biodoop Workflow
- Upload FASTA file into HDFS
- Convert FASTA to TAB format using
fasta2tab - Execute BLAST MapReduce job using
biodoop_blast - Analyze alignment results
---
4. Hadoop Configuration
Heap Size Experiment
| Heap Size (MB) | Result |
|---|---|
| 64 | Failed |
| 128 | Failed |
| 192 | Failed |
| 256 | Failed |
| 320 | Failed |
| 384 | Success |
| 448 | Failed |
| 512 | Failed |
Optimal value:
HADOOP_HEAPSIZE = 384 MB
Additional Configuration
dfs.client.file-block-storage-locations.timeout = 1200
dfs.namenode.fs-limits.min-block-size = 512
Custom block sizes used:
3 KB
5 KB
10 KB
---
5. DNA Sequence Alignment Results
Test 1
Reference:
Ancylostoma duodenale mitochondrion
Query:
Necator americanus mitochondrion
Results:
- Similarity > 89%
- Bit Score = 1225
Raspberry Pi Performance
| Metric | Default Block | 10 KB Block |
|---|---|---|
| CPU (%) | 82.69 | 62.50 |
| Memory (MB) | 344.76 | 335.91 |
| Time (min) | 17.66 | 25.86 |
---
Test 2
Reference:
Human papillomavirus type 132
Query:
Human papillomavirus type 134
Results:
- Similarity > 93%
- Bit Score = 43.5
Raspberry Pi Performance
| Metric | Default Block | 5 KB Block |
|---|---|---|
| CPU (%) | 92.14 | 67.01 |
| Memory (MB) | 355.49 | 312.36 |
| Time (min) | 15.45 | 19.84 |
---
Test 3
Reference:
Chaetoceros tenuissimus DNA virus
Query:
Chaetoceros lorenzianus DNA Virus
Results:
- Similarity > 83%
- Bit Score = 63
Raspberry Pi Performance
| Metric | Default Block | 3 KB Block |
|---|---|---|
| CPU (%) | 71.24 | 62.88 |
| Memory (MB) | 348.62 | 311.02 |
| Time (min) | 15.07 | 25.44 |
---
6. Analysis
Effects of Hadoop block distribution:
Advantages
- CPU usage reduced by 24.14%
- Memory usage reduced by 8.48%
- Better workload distribution across DataNodes
Disadvantages
- Processing time increased by 31.53%
- Additional MapReduce overhead
- Additional block-splitting overhead
For small DNA files, Hadoop overhead becomes significant relative to computation time.
---
7. Conclusion
Raspberry Pi can successfully be used as a low-cost Hadoop cluster platform for bioinformatics workloads.
Main findings:
- Hadoop runs reliably on Raspberry Pi.
- DNA sequence alignment using Biodoop was successfully executed.
- Average CPU utilization reached 73.08%.
- Average memory utilization reached 334.69 MB.
- Average completion time was 19.89 minutes.
- Block distribution reduced CPU and memory usage.
- Processing time increased because of MapReduce overhead.
---
Future Work
Future research may include:
- Hadoop Native ARM Library
- Newer Raspberry Pi hardware
- Larger DNA datasets
- Spark-based implementation
- Energy efficiency analysis
---
Citation
Turana, J. S., Sukoco, H., & Kusuma, W. A. (2016). Hadoop Performance Analysis on Raspberry Pi for DNA Sequence Alignment. TELKOMNIKA, 14(3), 1059–1066.