User Control Panel
Advertisements

HELP US, HELP YOU!

Optimize this!

 
Post new topic   Reply to topic    Bot Depot Forum Index -> Code Review
View unanswered posts
Author Message
Dazzy
Agent
Agent


Joined: 09 Jan 2004
Posts: 1731

Reputation: 72.3

PostPosted: Sat Mar 26, 2005 2:50 pm    Post subject: Reply with quote

This is to open a file and compare each word with each other to find the duplicates and get rid of them, then print to a new file...
This is probbally really bad code, i cant think of any other methods or where to look but im sure they exsist....Im using really long files so time isnt the best at the moment lasting upto half a hour! that is 174808 words though Surprised

Heres the code,
Code:
print "Ok, starting comparison!";<br /><br /><br />open (THING, "words.txt") or print "Cannot find first words";<br />my @words = <THING>;<br />close THING;<br />chomp @words;<br /><br />my $count = scalar(@words);<br />my @second;<br />my $add;<br />foreach $word (@words){<br />        $add = "yes";<br />   foreach $sec (@second){<br />  if($word eq $sec){<br />     print "\nFound a duplicate! $word";<br />     $add = "no";<br />  }<br />  <br />   }<br />   if($add eq "yes"){<br />  unshift(@second,$word);<br />   }<br />}<br />   <br />chomp @second;<br />@second = sort {$a <=> $b} @second;<br />$count2 = scalar(@second);<br />my $finished = join("\n",@second);<br />open (THING2, "words2.txt") or print "Cannot find first words";<br />print THING2 $finished;<br />close THING2;<br />my $duplicates = $count - $count2;<br />print "\nOk, Finished! $count2 words were written!\nWe found " . $duplicates . "words...";
Back to top
Siebe
God Like
God Like


Joined: 06 Jan 2004
Posts: 562
Location: Netherlands
Reputation: 39.8Reputation: 39.8Reputation: 39.8Reputation: 39.8

PostPosted: Sat Mar 26, 2005 4:40 pm    Post subject: Reply with quote

Code:
#! usr/bin/perl -w<br />use strict;<br /><br />sub removeDupes {<br />    my $string = shift;<br />    my $copy = ""; # Copy we are going to use later on<br />    <br />    # First append and prefix with spaces for our regular expressions<br />    $string = " $string ";<br />    <br />    # Then perform a loop and remove all duplicates<br />    do {<br />        # Replace and backup the copy<br />        $copy = $string; # <-- For while() clause<br />        $string =~ s/\s(\w+\s)(.*)\1(.*)/ $1$2$3/g;<br />    <br />    # Test if the strings are equal, if so, we performed no operations and should quit<br />    } while(length($copy) != length($string));<br />    <br />    # Remove the prefixing and appending spaces<br />    $string =~ s/^\s+//g;<br />    $string =~ s/\s+$//g;<br />    <br />    return $string; # Done! :-)<br />}<br /><br />print removeDupes("one one two two three four four five six six one two three four five six six") . "\n";
Back to top
Cer
Upgraded Agent
Upgraded Agent


Joined: 03 Feb 2004
Posts: 3776
Location: Michigan
Reputation: 146.9
votes: 4

PostPosted: Sat Mar 26, 2005 5:45 pm    Post subject: Reply with quote

You can also use Sort::Array to do this.

Code:
use Sort::Array;<br /><br />my @new = Discard_Duplicates (data => \@original);


http://search.cpan.org/~midi/Sort-Array-0.26/Array.pm

_________________
Current Site (2008) http://www.cuvou.com/
Back to top
Mojave
Almost An Agent
Almost An Agent


Joined: 01 Nov 2003
Posts: 1434

Reputation: 66.4

PostPosted: Sat Mar 26, 2005 6:56 pm    Post subject: Reply with quote

I like using a hash, not sure how efficient it is though:

Code:
print "Ok, starting comparison!";<br /><br />open (THING, "words.txt") or print "Cannot find first words";<br />my @words = <THING>;<br />close THING;<br />chomp @words;<br />my $count = scalar @words;<br /><br />my %hash = map { $_, 1 } @words;<br /><br />my $count2 = scalar keys %hash;<br /><br />open (THING2, "words2.txt") or print "Cannot find first words";<br />print THING2 join( "\n", sort keys %hash );<br />close THING2;<br />my $duplicates = $count - $count2;<br />print "\nOk, Finished! $count2 words were written!\nWe found " . $duplicates . "words...";
Back to top
brother
Senior Member
Senior Member


Joined: 06 Aug 2004
Posts: 156
Location: Belgium
Reputation: 24.5Reputation: 24.5

PostPosted: Sat Mar 26, 2005 9:40 pm    Post subject: Reply with quote

Here is my (short) way of doing it, very efficient and using only the original array holding the lines of the file and one temporary hash...

@words contains our list of words...

Code:
my %done = undef;<br />@done{@words} = ();<br />@words = sort keys %done;
Back to top
Mojave
Almost An Agent
Almost An Agent


Joined: 01 Nov 2003
Posts: 1434

Reputation: 66.4

PostPosted: Sat Mar 26, 2005 10:35 pm    Post subject: Reply with quote

Um, did you try that code out, brother? It doesn't work. But you're doing essentially what I was doing, turning the list into a hash for easy merging. And mine has the added benefit of working. Razz

EDIT: On second try, I found your method does work, it just doesn't compile under strict and warnings, which I always use. In any case, my method is shorter, one line instead of two! hehe
Back to top
brother
Senior Member
Senior Member


Joined: 06 Aug 2004
Posts: 156
Location: Belgium
Reputation: 24.5Reputation: 24.5

PostPosted: Sat Mar 26, 2005 10:45 pm    Post subject: Reply with quote

Hmm weird, i use strict & warnings too... Never threw up on me before, i'll try the map thing though.

EDIT: It does EXACTLY the same... even sorting the keys at the end (merged in the writing code) Very Happy

Again, thanks for the map command suggestion, i'll see if i can update some of my code and do some benchmarks, it will probably point out using 'map' is more efficient.
Back to top
Dazzy
Agent
Agent


Joined: 09 Jan 2004
Posts: 1731

Reputation: 72.3

PostPosted: Sat Mar 26, 2005 11:48 pm    Post subject: Reply with quote

thanks everyone, was only a 5 min thing to sort out some files but thought it might spark some interest here, which it obviously has so hey, kill two birds with one stone isnt it Laughing
Back to top
Display posts from previous:   
Post new topic   Reply to topic    Bot Depot Forum Index -> Code Review All times are GMT
Page 1 of 1

 



Protected by phpBB Security phpBB-TweakS
phpBB Security Has Blocked 9 Exploit Attempts.
Antispam Captcha Mod by phpbb-security.com
Powered by phpBB © 2001, 2005 phpBB Group