2024年2月2日 星期五

NLP 學習筆記 : 安裝 Hugging Face NLP 工具集套件

Transformer 是目前自然語言處理最常用也最先進的架構, 它可以執行文本的分類 (classification), 產生 (generation), 摘要 (summerization), 與問答 (Q&A) 等任務. Hugging Face 提供了開源的 NLP 工具集套件, 透過統一的介面讓使用者能方便地開發 NLP 應用, 其主要工具集如下 :
  • transformers 套件 : 實作 Transformer 架構
  • datasets 資料集 : 統一的資料集處理工具
參考書籍 :
教學文件參考 :



一. 安裝 transformers 套件 :

今天下午在 Thonny 中 (自帶 Python 3.10) 安裝了 Hugging Face 的 transformers 套件 : 

D:\python>pip install transformers    
Collecting transformers
  Downloading transformers-4.37.2-py3-none-any.whl.metadata (129 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.4/129.4 kB 692.6 kB/s eta 0:00:00
Requirement already satisfied: filelock in c:\users\tony1\appdata\roaming\python\python310\site-packages (from transformers) (3.12.3)
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from transformers) (0.20.1)
Requirement already satisfied: numpy>=1.17 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from transformers) (1.24.3)
Requirement already satisfied: packaging>=20.0 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from transformers) (23.1)
Requirement already satisfied: pyyaml>=5.1 in c:\users\tony1\appdata\local\programs\thonny\lib\site-packages (from transformers) (6.0.1)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2023.12.25-cp310-cp310-win_amd64.whl.metadata (41 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.0/42.0 kB 2.0 MB/s eta 0:00:00
Requirement already satisfied: requests in c:\users\tony1\appdata\roaming\python\python310\site-packages (from transformers) (2.31.0)
Collecting tokenizers<0.19,>=0.14 (from transformers)
  Downloading tokenizers-0.15.1-cp310-none-win_amd64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.4.2-cp310-none-win_amd64.whl.metadata (3.9 kB)
Requirement already satisfied: tqdm>=4.27 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from transformers) (4.66.1)
Requirement already satisfied: fsspec>=2023.5.0 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from huggingface-hub<1.0,>=0.19.3->transformers) (2023.9.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from huggingface-hub<1.0,>=0.19.3->transformers) (4.9.0)
Requirement already satisfied: colorama in c:\users\tony1\appdata\local\programs\thonny\lib\site-packages (from tqdm>=4.27->transformers) (0.4.6)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from requests->transformers) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from requests->transformers) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from requests->transformers) (2.1.0)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from requests->transformers) (2023.7.22)
Downloading transformers-4.37.2-py3-none-any.whl (8.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.4/8.4 MB 2.6 MB/s eta 0:00:00
Downloading regex-2023.12.25-cp310-cp310-win_amd64.whl (269 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 269.5/269.5 kB 5.5 MB/s eta 0:00:00
Downloading safetensors-0.4.2-cp310-none-win_amd64.whl (269 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 269.5/269.5 kB 3.3 MB/s eta 0:00:00
Downloading tokenizers-0.15.1-cp310-none-win_amd64.whl (2.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 3.4 MB/s eta 0:00:00
Installing collected packages: safetensors, regex, tokenizers, transformers
Successfully installed regex-2023.12.25 safetensors-0.4.2 tokenizers-0.15.1 transformers-4.37.2

看起來 transformers 套件並沒有很大, 安裝完先匯入整個模組來檢查版本 :

>>> import transformers  
>>> transformers.__version__   
'4.37.2'   

然後用 dir() 檢視其內容 : 

>>> dir(transformers)   
['ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP', 'ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST', 'ALIGN_PRETRAINED_CONFIG_ARCHIVE_MAP', 'ALIGN_PRETRAINED_MODEL_ARCHIVE_LIST', 'ALL_PRETRAINED_CONFIG_ARCHIVE_MAP', 'ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP', 'ALTCLIP_PRETRAINED_MODEL_ARCHIVE_LIST', 'ASTConfig', 'ASTFeatureExtractor', 'ASTForAudioClassification', 'ASTModel', 'ASTPreTrainedModel', 'AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP', 'AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST', 'AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP', 'AUTOFORMER_PRETRAINED_MODEL_ARCHIVE_LIST', 'Adafactor', 'AdamW', 'AdamWeightDecay', 'AdaptiveEmbedding', 'AddedToken', 'Agent', 'AlbertConfig', 'AlbertForMaskedLM', 'AlbertForMultipleChoice', 'AlbertForPreTraining', 'AlbertForQuestionAnswering', 'AlbertForSequenceClassification', 'AlbertForTokenClassification', 'AlbertModel', 'AlbertPreTrainedModel', 'AlbertTokenizer', 'AlbertTokenizerFast', 'AlignConfig', 'AlignModel', 'AlignPreTrainedModel', 'AlignProcessor', 'AlignTextConfig', 'AlignTextModel', 'AlignVisionConfig', 'AlignVisionModel', 'AltCLIPConfig', 'AltCLIPModel', 'AltCLIPPreTrainedModel', 'AltCLIPProcessor', 'AltCLIPTextConfig', 'AltCLIPTextModel', 'AltCLIPVisionConfig', 'AltCLIPVisionModel', 'AlternatingCodebooksLogitsProcessor', 'AudioClassificationPipeline', 'AutoBackbone', 'AutoConfig', 'AutoFeatureExtractor', 'AutoImageProcessor', 'AutoModel', 'AutoModelForAudioClassification', 'AutoModelForAudioFrameClassification', 'AutoModelForAudioXVector', 'AutoModelForCTC', 'AutoModelForCausalLM', 'AutoModelForDepthEstimation', 'AutoModelForDocumentQuestionAnswering', 'AutoModelForImageClassification', 'AutoModelForImageSegmentation', 'AutoModelForImageToImage', 'AutoModelForInstanceSegmentation', 'AutoModelForMaskGeneration', 'AutoModelForMaskedImageModeling', 'AutoModelForMaskedLM', 'AutoModelForMultipleChoice', 'AutoModelForNextSentencePrediction', 'AutoModelForObjectDetection', 'AutoModelForPreTraining', 'AutoModelForQuestionAnswering', 'AutoModelForSemanticSegmentation', 'AutoModelForSeq2SeqLM', 'AutoModelForSequenceClassification', 'AutoModelForSpeechSeq2Seq', 'AutoModelForTableQuestionAnswering', 'AutoModelForTextEncoding', 'AutoModelForTextToSpectrogram', 'AutoModelForTextToWaveform', 'AutoModelForTokenClassification', 'AutoModelForUniversalSegmentation', 'AutoModelForVideoClassification', 'AutoModelForVision2Seq', 'AutoModelForVisualQuestionAnswering', 'AutoModelForZeroShotImageClassification', 'AutoModelForZeroShotObjectDetection', 'AutoModelWithLMHead', 'AutoProcessor', 'AutoTokenizer', 'AutoformerConfig', 'AutoformerForPrediction', 'AutoformerModel', 'AutoformerPreTrainedModel', 'AutomaticSpeechRecognitionPipeline', 'AwqConfig', 'AzureOpenAiAgent', 'BARK_PRETRAINED_MODEL_ARCHIVE_LIST', 'BART_PRETRAINED_MODEL_ARCHIVE_LIST', 'BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP', 'BEIT_PRETRAINED_MODEL_ARCHIVE_LIST', 'BERT_PRETRAINED_CONFIG_ARCHIVE_MAP', 'BERT_PRETRAINED_MODEL_ARCHIVE_LIST', 'BIGBIRD_PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP', 'BIGBIRD_PEGASUS_PRETRAINED_MODEL_ARCHIVE_LIST', 'BIG_BIRD_PRETRAINED_CONFIG_ARCHIVE_MAP', 'BIG_BIRD_PRETRAINED_MODEL_ARCHIVE_LIST', 'BIOGPT_PRETRAINED_CONFIG_ARCHIVE_MAP', 'BIOGPT_PRETRAINED_MODEL_ARCHIVE_LIST', 'BIT_PRETRAINED_CONFIG_ARCHIVE_MAP', 'BIT_PRETRAINED_MODEL_ARCHIVE_LIST', 'BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP', 'BLENDERBOT_PRETRAINED_MODEL_ARCHIVE_LIST', 'BLENDERBOT_SMALL_PRETRAINED_CONFIG_ARCHIVE_MAP', 'BLENDERBOT_SMALL_PRETRAINED_MODEL_ARCHIVE_LIST', 'BLIP_2_PRETRAINED_CONFIG_ARCHIVE_MAP', 'BLIP_2_PRETRAINED_MODEL_ARCHIVE_LIST', 'BLIP_PRETRAINED_CONFIG_ARCHIVE_MAP', 'BLIP_PRETRAINED_MODEL_ARCHIVE_LIST', 'BLOOM_PRETRAINED_CONFIG_ARCHIVE_MAP', 'BLOOM_PRETRAINED_MODEL_ARCHIVE_LIST', 'BRIDGETOWER_PRETRAINED_CONFIG_ARCHIVE_MAP', 'BRIDGETOWER_PRETRAINED_MODEL_ARCHIVE_LIST', 'BROS_PRETRAINED_CONFIG_ARCHIVE_MAP', 'BROS_PRETRAINED_MODEL_ARCHIVE_LIST', 'BarkCausalModel', 'BarkCoarseConfig', 'BarkCoarseModel', 'BarkConfig', 'BarkFineConfig', 'BarkFineModel', 'BarkModel', 'BarkPreTrainedModel', 'BarkProcessor', 'BarkSemanticConfig', 'BarkSemanticModel', 'BartConfig', 'BartForCausalLM', 'BartForConditionalGeneration', 'BartForQuestionAnswering', 'BartForSequenceClassification', 'BartModel', 'BartPreTrainedModel', 'BartPretrainedModel', 'BartTokenizer', 'BartTokenizerFast', 'BarthezTokenizer', 'BarthezTokenizerFast', 'BartphoTokenizer', 'BasicTokenizer', 'BatchEncoding', 'BatchFeature', 'BeamScorer', 'BeamSearchScorer', 'BeitBackbone', 'BeitConfig', 'BeitFeatureExtractor', 'BeitForImageClassification', 'BeitForMaskedImageModeling', 'BeitForSemanticSegmentation', 'BeitImageProcessor', 'BeitModel', 'BeitPreTrainedModel', 'BertConfig', 'BertForMaskedLM', 'BertForMultipleChoice', 'BertForNextSentencePrediction', 'BertForPreTraining', 'BertForQuestionAnswering', 'BertForSequenceClassification', 'BertForTokenClassification', 'BertGenerationConfig', 'BertGenerationDecoder', 'BertGenerationEncoder', 'BertGenerationPreTrainedModel', 'BertGenera…

從後面的 ... 可知其成員非常龐大啊! 

谷歌 Colab 執行環境已經預先安裝 transformers 套件可直接匯入使用, 不需安裝 : 




可見其版本較上面本機安裝的 v4.37.2 要舊一些 (v4.35.2). 


二. 安裝 datasets 工具集套件 :

此工具集負責資料預處理, 例如資料集的載入與儲存, 檢視, 排序, 過濾, 抽樣, 與拆分等, Hugging Face 的 datasets 套件提供了統一的資料集處理 API 介面, 可以大幅降低資料預處理的難度與工作量 (在資料科學中, 預處理佔據大部分工作量, 可說是一種 dirty work). 

此套件在本機可直接用 pip install 安裝 :

D:\python>pip install datasets    
Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl.metadata (20 kB)
Requirement already satisfied: filelock in c:\users\tony1\appdata\roaming\python\python310\site-packages (from datasets) (3.12.3)
Requirement already satisfied: numpy>=1.17 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from datasets) (1.24.3)
Requirement already satisfied: pyarrow>=8.0.0 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from datasets) (13.0.0)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Requirement already satisfied: dill<0.3.8,>=0.3.0 in c:\users\tony1\appdata\local\programs\thonny\lib\site-packages (from datasets) (0.3.7)
Requirement already satisfied: pandas in c:\users\tony1\appdata\roaming\python\python310\site-packages (from datasets) (2.0.3)
Requirement already satisfied: requests>=2.19.0 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from datasets) (2.31.0)
Requirement already satisfied: tqdm>=4.62.1 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from datasets) (4.66.1)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-win_amd64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Requirement already satisfied: fsspec<=2023.10.0,>=2023.1.0 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from fsspec[http]<=2023.10.0,>=2023.1.0->datasets) (2023.9.1)
Requirement already satisfied: aiohttp in c:\users\tony1\appdata\roaming\python\python310\site-packages (from datasets) (3.9.1)
Requirement already satisfied: huggingface-hub>=0.19.4 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from datasets) (0.20.1)
Requirement already satisfied: packaging in c:\users\tony1\appdata\roaming\python\python310\site-packages (from datasets) (23.1)
Requirement already satisfied: pyyaml>=5.1 in c:\users\tony1\appdata\local\programs\thonny\lib\site-packages (from datasets) (6.0.1)
Requirement already satisfied: attrs>=17.3.0 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from aiohttp->datasets) (23.1.0)
Requirement already satisfied: multidict<7.0,>=4.5 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from aiohttp->datasets) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from aiohttp->datasets) (1.9.4)
Requirement already satisfied: frozenlist>=1.1.1 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from aiohttp->datasets) (1.4.1)
Requirement already satisfied: aiosignal>=1.1.2 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: async-timeout<5.0,>=4.0 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from aiohttp->datasets) (4.0.3)
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from huggingface-hub>=0.19.4->datasets) (4.9.0)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from requests>=2.19.0->datasets) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from requests>=2.19.0->datasets) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from requests>=2.19.0->datasets) (2.1.0)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from requests>=2.19.0->datasets) (2023.7.22)
Requirement already satisfied: colorama in c:\users\tony1\appdata\local\programs\thonny\lib\site-packages (from tqdm>=4.62.1->datasets) (0.4.6)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.15-py310-none-any.whl.metadata (7.2 kB)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from pandas->datasets) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in c:\users\tony1\appdata\roaming\python\python310\site-packages (from pandas->datasets) (2023.3)
Requirement already satisfied: six>=1.5 in c:\users\tony1\appdata\local\programs\thonny\lib\site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)
Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 507.1/507.1 kB 2.0 MB/s eta 0:00:00
Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 2.0 MB/s eta 0:00:00
Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Downloading xxhash-3.4.1-cp310-cp310-win_amd64.whl (29 kB)
Installing collected packages: xxhash, pyarrow-hotfix, multiprocess, datasets
Successfully installed datasets-2.16.1 multiprocess-0.70.15 pyarrow-hotfix-0.6 xxhash-3.4.1

檢查版本 :

>>> import datasets  
>>> datasets.__version__   
'2.16.1'

用 dir() 檢視其內容 :

>>> dir(datasets)     
['Array2D', 'Array3D', 'Array4D', 'Array5D', 'ArrowBasedBuilder', 'Audio', 'AudioClassification', 'AutomaticSpeechRecognition', 'BeamBasedBuilder', 'BuilderConfig', 'ClassLabel', 'Dataset', 'DatasetBuilder', 'DatasetDict', 'DatasetInfo', 'DownloadConfig', 'DownloadManager', 'DownloadMode', 'Features', 'GeneratorBasedBuilder', 'Image', 'ImageClassification', 'IterableDataset', 'IterableDatasetDict', 'LanguageModeling', 'Metric', 'MetricInfo', 'NamedSplit', 'NamedSplitAll', 'QuestionAnsweringExtractive', 'ReadInstruction', 'Sequence', 'Split', 'SplitBase', 'SplitDict', 'SplitGenerator', 'SplitInfo', 'StreamingDownloadManager', 'SubSplitInfo', 'Summarization', 'TaskTemplate', 'TextClassification', 'Translation', 'TranslationVariableLanguages', 'Value', 'VerificationMode', 'Version', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'are_progress_bars_disabled', 'arrow_dataset', 'arrow_reader', 'arrow_writer', 'builder', 'combine', 'concatenate_datasets', 'config', 'data_files', 'dataset_dict', 'deprecation_utils', 'disable_caching', 'disable_progress_bar', 'disable_progress_bars', 'doc_utils', 'download', 'enable_caching', 'enable_progress_bar', 'enable_progress_bars', 'exceptions', 'experimental', 'extract', 'features', 'file_utils', 'filesystems', 'fingerprint', 'formatting', 'get_dataset_config_info', 'get_dataset_config_names', 'get_dataset_default_config_name', 'get_dataset_infos', 'get_dataset_split_names', 'hub', 'info', 'info_utils', 'inspect', 'inspect_dataset', 'inspect_metric', 'interleave_datasets', 'is_caching_enabled', 'is_progress_bar_enabled', 'iterable_dataset', 'keyhash', 'list_datasets', 'list_metrics', 'load', 'load_dataset', 'load_dataset_builder', 'load_from_disk', 'load_metric', 'logging', 'metadata', 'metric', 'naming', 'packaged_modules', 'parallel', 'patching', 'percent', 'py_utils', 'search', 'set_caching_enabled', 'sharding', 'splits', 'stratify', 'streaming', 'table', 'tasks', 'tf_utils', 'tqdm', 'track', 'typing', 'utils', 'version']

不過 Colab 並未預載 datasets 套件, 每次重新連線都必須自行安裝一次 :

!pip install datasets 




匯入檢視版本 :




可見與本機一樣都是最新的 v2.16.1 版. 

沒有留言 :