TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used technique to represent text data as numerical feature vectors. You can use the TfidfVectorizer
class from the sklearn.feature_extraction.text
module to calculate TF-IDF feature weights for your text data. Here's how you can do it:
from sklearn.feature_extraction.text import TfidfVectorizer # Sample corpus (list of text documents) corpus = [ "This is the first document.", "This document is the second document.", "And this is the third one.", "Is this the first document?" ] # Create a TfidfVectorizer instance vectorizer = TfidfVectorizer() # Fit and transform the corpus to calculate TF-IDF feature weights tfidf_matrix = vectorizer.fit_transform(corpus) # Convert the TF-IDF matrix to a dense array tfidf_array = tfidf_matrix.toarray() # Get the feature names (words) corresponding to the columns feature_names = vectorizer.get_feature_names_out() # Print the TF-IDF feature weights for each document for i, doc in enumerate(corpus): print("Document:", doc) for feature, weight in zip(feature_names, tfidf_array[i]): if weight > 0: print(f" {feature}: {weight:.4f}")
In this example, the TfidfVectorizer
is used to transform the input text data into a TF-IDF matrix. The fit_transform()
method computes the TF-IDF feature weights for each word in the corpus, and toarray()
converts the sparse matrix to a dense array for easier manipulation.
Keep in mind that the example above shows the raw TF-IDF values for each term in the documents. In practice, you might want to normalize the values or perform additional preprocessing steps on your text data before applying TF-IDF.
Also, remember that the order of terms (features) in the feature_names
list corresponds to the columns of the tfidf_array
, so you can map the TF-IDF values to specific terms in the documents.
What is TF-IDF and how does it work?
from sklearn.feature_extraction.text import TfidfVectorizer # Sample documents for demonstration documents = ["This is the first document.", "This document is the second document.", "And this is the third one.", "Is this the first document?"] # Initialize TfidfVectorizer tfidf_vectorizer = TfidfVectorizer() # Fit and transform the documents tfidf_matrix = tfidf_vectorizer.fit_transform(documents) # View the TF-IDF matrix print(tfidf_matrix)
How to extract TF-IDF features using sklearn?
TfidfVectorizer
from sklearn
to extract TF-IDF features from text data.from sklearn.feature_extraction.text import TfidfVectorizer # Initialize TfidfVectorizer with custom parameters if needed tfidf_vectorizer = TfidfVectorizer() # Fit and transform text data tfidf_matrix = tfidf_vectorizer.fit_transform(text_data)
How to access TF-IDF feature weights?
TfidfVectorizer
.# Get feature names feature_names = tfidf_vectorizer.get_feature_names_out() # Get TF-IDF feature weights tfidf_weights = tfidf_matrix.toarray() # Access TF-IDF weights for a specific document document_index = 0 print("TF-IDF weights for document {}: {}".format(document_index, tfidf_weights[document_index]))
Visualizing TF-IDF feature weights
import matplotlib.pyplot as plt # Plot TF-IDF feature weights for a specific document plt.bar(range(len(feature_names)), tfidf_weights[document_index]) plt.xlabel('Feature Index') plt.ylabel('TF-IDF Weight') plt.title('TF-IDF Feature Weights for Document {}'.format(document_index)) plt.show()
Customizing TF-IDF parameters in sklearn
TfidfVectorizer
.# Initialize TfidfVectorizer with custom parameters tfidf_vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), token_pattern=r'\b\w+\b') # Fit and transform text data tfidf_matrix = tfidf_vectorizer.fit_transform(text_data)
TF-IDF feature extraction for large datasets
# Use mini-batch processing for large datasets from sklearn.feature_extraction.text import TfidfVectorizer # Initialize TfidfVectorizer with custom parameters tfidf_vectorizer = TfidfVectorizer(max_features=10000) # Fit and transform text data in batches for batch in batches_of_text_data: tfidf_matrix_partial = tfidf_vectorizer.fit_transform(batch) # Further processing of partial TF-IDF matrix
TF-IDF for document similarity
from sklearn.metrics.pairwise import cosine_similarity # Compute cosine similarity between documents based on TF-IDF similarity_matrix = cosine_similarity(tfidf_matrix)
Applying TF-IDF in text classification
from sklearn.model_selection import train_test_split from sklearn.svm import SVC # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, labels, test_size=0.2) # Initialize and train classifier classifier = SVC() classifier.fit(X_train, y_train) # Evaluate classifier accuracy = classifier.score(X_test, y_test) print("Accuracy:", accuracy)
TF-IDF with pre-processing techniques
from nltk.stem import PorterStemmer from sklearn.feature_extraction.text import TfidfVectorizer # Initialize stemmer stemmer = PorterStemmer() # Custom preprocessor function def custom_preprocessor(text): # Apply stemming to text stemmed_text = ' '.join([stemmer.stem(word) for word in text.split()]) return stemmed_text # Initialize TfidfVectorizer with custom preprocessor tfidf_vectorizer = TfidfVectorizer(preprocessor=custom_preprocessor) # Fit and transform text data tfidf_matrix = tfidf_vectorizer.fit_transform(text_data)
TF-IDF with sublinear TF scaling
from sklearn.feature_extraction.text import TfidfVectorizer # Initialize TfidfVectorizer with sublinear TF scaling tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True) # Fit and transform text data tfidf_matrix = tfidf_vectorizer.fit_transform(text_data)
pug lemmatization wildfly-10 associations stderr each string-interpolation fscalendar cpu-word rest