Rekolha dataset email ho label spam (1) atau ham (0). Dataset populer: Enron Spam Dataset, SpamAssassin Public Corpus.
Ezemplu:
| Email | Label |
|-------------------------|-------|
| "Dapatkan diskon 50%" | 1 |
| "Kabar baik untuk Anda" | 0 |
Etapa ba prosesu ne mak hanesan tuir mai nee:
Contoh: "Saya menang hadiah!" → ["saya", "menang", "hadiah"]
Bag-of-Words (BoW) dan TF-IDF mak teknika nebee uza iha etapa nee iha mos teknika seluk:
Bag-of-Words:
| Kata | Frekuensi |
|------------|-----------|
| menang | 2 |
| diskon | 1 |
| hadiah | 1 |
TF-IDF: Contoh kalkulasi TF-IDF:
TF = (Frekuensia liafuan)/(Total liafuan iha dokumen)
IDF = log(Total dokumen / total dokument nebee kontein liafuan)
TF-IDF = TF * IDF
Hili modelu hanesan:
Trenu model utiliza data traning utiliza tekniku nebee relevante.
Ezemplu: Utilizasaun algoritma Naïve Bayes:
P(Spam|Word) = (P(Word|Spam) * P(Spam)) / P(Word)
Metrik nebee utiliza ba evaluasaun:
Integrasaun model ba sistema email hodi detekta spam ho realtime.
Ezemplu: Sistema email sei marka automatikamente email sira nebee tama hanesan spam se modelu prevee spam.