ezpz.data.llamaΒΆ
ezpz/data/llama.py
LlamaDataLoader
ΒΆ
Source code in src/ezpz/data/llama.py
__init__(dataset_repo, tokenizer_name='hf-internal-testing/llama-tokenizer', max_length=512, batch_size=8, shuffle=True, num_workers=2, split='train')
ΒΆ
Initializes the LlamaDataLoader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_repo
|
str
|
Hugging Face dataset repository path. |
required |
tokenizer_name
|
str
|
Name or path of the LLaMA tokenizer. |
'hf-internal-testing/llama-tokenizer'
|
max_length
|
int
|
Maximum sequence length for tokenization. |
512
|
batch_size
|
int
|
Batch size for the DataLoader. |
8
|
shuffle
|
bool
|
Whether to shuffle the dataset. |
True
|
num_workers
|
int
|
Number of workers for data loading. |
2
|
split
|
str
|
Dataset split to load (e.g., "train", "validation"). |
'train'
|
Source code in src/ezpz/data/llama.py
get_data_loader()
ΒΆ
Creates and returns a PyTorch DataLoader.
Returns:
| Name | Type | Description |
|---|---|---|
DataLoader |
DataLoader
|
A PyTorch DataLoader for the tokenized dataset. |