mercredi 6 mai 2015

Line by line check for large amount of keywords with python

I am iterating through many csv files with 1000 to 3000 lines checking each line whether one of 70000 key words is inherited in a text of 140 characters. My problem at the moment is, that my code runs extremely slow. I guess because of the many iterations. I am relatively new programer and not sure what is the best way to speed up. It took 2 hours to check one entire file and there are still many many I need to go through. My logic at the moment is: import csv as list of lists -> for each list in list take the first element and search for each of the 70000 keywords whether it is mentioned.

Currently my code looks like the following:

import re
import csv


def findname(lst_names,text):
  for name in lst_names:
  name_match = re.search(r'@'+str(name), text)
  if name_match:
    return name 

lst_users = importusr_lst('users.csv') #defined function to import 700000 keywords
lst_successes = []
with open(file, 'rb') as csvfile:
  filereader = csv.reader(csvfile, delimiter = ',')
  content = []

  for row in filereader:
    content.append(row)
  if len(content)>1:
    for row in content:
      hit = []
      mentioned = findname(lst_names, row[0]) #row[0] is the text of 140 characters

      if mentioned:
        hit = row[1:7]
        hit.append(mentioned)
        lst_successes.append(hit)

return lst_successes

Thanks for any help!

Aucun commentaire:

Enregistrer un commentaire