On Saturday 28 August 2010 11:48:10 Joseph Sinclair wrote: > OK, > I've attached a complete program that works, if you want to just get it > done, but I've also described what went wrong in your first attempt below. > I really appreciate what you have done. I more so like the description of what I did wrong. Using readlines() is a better approach like you said, less disk thrashing. I was using /usr/bin/python3, so print() is now a function. My next step is to take the host list and identify where the IP is using pygeoip. Thank you again. :) -Kevin > # the i value was just for debugging, so I dropped it > primaryfile = open('/tmp/extract','r') > # read the primary file into a list for speed and so you aren't reading > more than once primary_lines = primaryfile.readlines() > # you didn't specify a mode for this, so it defaulted to read-only. Be > explicit for clarity secondaryfile = open('/tmp/unload', 'r') > # Open a separate file for output, otherwise you would have been writing > and reading the same file over and over again, which usually causes errors > outputfile = open('/tmp/result-file', 'w') > # read the second file into a list, then you can scan through it over and > over without hammering disk and re-reading a file you might have modified. > secondary_lines = secondaryfile.readlines() > # print is a statement, not a function. > print 'opened files' > # loop through the list, not the file > for line in primary_lines: > pcompare = line > # print is a statement, use the formatting operator to print variable > values print 'primary line = %s' % (pcompare) > # loop through the list, not the file > for row in secondary_lines: > scompare = row > if pcompare == scompare: > # print as a statement, not a function > print 'secondary line = %s' % (scompare) > # you were writing random # characters in a file (most likely after > the line read), this writes a comment to a new file, which is usually > clearer. # invert the test, and add the line to a set here then write out > the set at the end to get an output of lines without duplication. > outputfile.write('#%s' % (scompare)) > print 'Done' > > Kevin Faulkner wrote: > > Sorry about the time issue. > > > > On Friday 27 August 2010 23:50:00 you wrote: > >> I hope these are small files, the algorithm you wrote is not going to > >> run well as file size gets large (over 10,000 entries) Have you checked > >> the space/tab situation? Python uses indentation changes to indicate > >> the end of a block, so inconsistent use of tabs and spaces freaks it > >> out. Here are > > > >> a couple questions: > > This is not a school project, so you won't be doing my homework or > > anything :) The space/tab issue is okay, but the script does not even > > get to the print(i), I even tried for line in secondaryfile: and the for > > loop still wouldn't be executed. > > > >> Are these always numbers? > > > > Yes, they are IP's from an Apache error log. > > > >> Do the files have to remain in their original order, or can you reorder > >> them during processing? How often does this have to run? > > > > they are not in order because one list is 852 entries and another list is > > 3300 entries. This script only needs to run once. > > > >> Do you have to "comment" the duplicate, or can you remove it? > > > > The plan is to remove it, but I wanted to see if my removal method would > > work, so I was trying to put a comment next to it. > > > >> Are there any other requirements not obvious from the description below? > > > > No real requirements, if anyone would like the original files I can give > > them to you, a lot of them are bots. > > Thank you :) > > -Kevin > > > >> Kevin Faulkner wrote: > >>> I was trying to pull duplicates out of 2 different files. Needless to > >>> say there are duplicates I would place a # next to the duplicate. > >>> Example files: file 1: file 2: > >>> 433.3 947.3 > >>> 543.1 749.0 > >>> 741.1 859.2 > >>> 238.5 433.3 > >>> 839.2 229.1 > >>> 583.6 990.1 > >>> 863.4 741.1 > >>> 859.2 101.8 > >>> > >>> import string > >>> i=1 > >>> primaryfile = open('/tmp/extract','r') > >>> secondaryfile = open('/tmp/unload') > >>> > >>> for line in primaryfile: > >>> pcompare = line > >>> print(pcompare) > >>> > >>> for row in secondaryfile: > >>> i = i + 1 > >>> print(i) > >>> scompare = row > >>> > >>> if pcompare == scompare: > >>> print(scompare) > >>> secondaryfile.write('#') > >>> > >>> With this code it should go through the files and find a duplicate and > >>> place a '#' next to it. But for some reasonson it doesn't even get to > >>> the second for statement. I don't know what else to do. Please offer > >>> some assistance. :) --------------------------------------------------- > >>> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us > >>> To subscribe, unsubscribe, or to change your mail settings: > >>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss > > > > --------------------------------------------------- > > PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us > > To subscribe, unsubscribe, or to change your mail settings: > > http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss --------------------------------------------------- PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us To subscribe, unsubscribe, or to change your mail settings: http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss